cover_Do%20they%20like%20me.png

Capstone_1_cover.png

capstone_slide_2.png

Capstone_slide_3%20%281%29.png

overview_of_data.png

Google_Trends_-_2019-12-11_20.18.44.png

keyword%20data%20science.png

Hypothesis-1.png

2012-2014%20Interest%20by%20month.png

When we first gathered and looked at our data,

  • we had interest level broken down by month only...

Procedure We Used:

We needed to change our data into a different type, and, add some columns, manipulate the data, visualize and test results.

    - python, numpy, scipy and pandas:
    - filtering, grouping, sorting
    - aggregations, descriptive stats
    - tests for statistical analysis
- feature design:
    - year (2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)
    - month (numerical; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), 
    - month (name: Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sept, Oct, Nov, Dec
    - weekday (numerical: 0, 1, 2, 3, 4, 5, 6)
    - weekday (name: Mon, Tues, Wed, Thurs, Fri, Sat, Sun)
    - quarter (financial: 1, 2, 3, 4)
 - visualize discrete features (new columns) as a y-axis variables:
     - Matplotlib and Seaborn libraries
         - Histograms, Bar charts, Line Charts

stats%20Copy.png

stats%20Copy%20%281%29.png

stats.png

In [445]:
stats1 = df.groupby(['weekday_name'], as_index=False).agg({'data_science': "mean"})
stats2 = stats = df.groupby(['weekday_name'], as_index=False).agg({'data_scientist': "mean"})

# print(round(stats1.sort_values('data_science'),2))


plt.figure(figsize=(15,16))
stats1.hist(bins=50)
plt.title('Distribution of Weekday Averages - Data Science\n')
plt.show();
<Figure size 1080x1152 with 0 Axes>
In [446]:
# print(round(stats2.sort_values('data_scientist'),2));


plt.figure(figsize=(20,20))

stats2.hist(bins=50);
plt.title('Distribution of Weekday Averages - Data Scientist')
plt.show();
<Figure size 1440x1440 with 0 Axes>
In [447]:
stats = df.groupby(['weekday_name'], as_index=False).agg({'data_scientist': "mean", 'data_science': "mean"})

stats = pd.DataFrame(stats)

# stats = stats.sort_values('weekday_name')
x = stats['data_science'].sort_values()
x1 = stats['data_scientist'].sort_values()
mean1 = x.mean()
y = stats['weekday_name'].sort_values()


plt.figure(figsize=(17,6))

sns.barplot(x=x, y=y, data=stats)
plt.title('\nAverage Interest Over Time, By Weekday: 2012-2019\nKeyword Term: Data Science (US)\n\n (n=96\nmean=42.8129\nmedian=43.53\nstdv=1.85\nmin=40.00\nmax=44.846)\n')
plt.ylabel("")
plt.xlabel("\nInterest Level (0-100)")
plt.axvline(x=mean1, color='black', linestyle='--')
plt.axvline(x=x.median(), color='blue', linestyle='--')
plt.axvline(x=x.std(), color='yellow', linestyle='--')
plt.axvline(x=x.max(), color='red', linestyle='--')
plt.axvline(x=x.min(), color='cyan', linestyle='--')

plt.show()

# print(x.median())
# print(x.sort_values())

# stats['data_science'].mean() 42.8129
# stats['data_scientist'].mean() 46.5744

# print(x.min())
# print(x.max())
In [448]:
stats = df.groupby(['weekday_name'], as_index=False).agg({'data_scientist': "mean", 'data_science': "mean"})

stats = pd.DataFrame(stats)

# stats = stats.sort_values('weekday_name')
x = stats['data_science'].sort_values()
x1 = stats['data_scientist'].sort_values()
mean1 = x.mean()
mean2 = x1.mean()


plt.figure(figsize=(17,6))

sns.barplot(x=x1, y='weekday_name', data=stats)
plt.title('\nAverage Interest Over Time, By Weekday: 2012-2019\nKeyword Term: Data Scientist (US)\n\n (n=96\nmean=46.5744\nmedian=46.0769\nstdv=1.85\nmin=43.92\nmax=49.23)\n')
plt.ylabel("")
plt.xlabel("\nInterest Level (0-100)")
plt.axvline(x=mean2, color='black', linestyle='--')
plt.axvline(x=x1.median(), color='blue', linestyle='--')
plt.axvline(x=x1.std(), color='yellow', linestyle='--')
plt.axvline(x=x1.max(), color='red', linestyle='--')
plt.axvline(x=x1.min(), color='cyan', linestyle='--')

# plt.legend(('mean', 'median', 'stdv', 'max', 'min'), loc='center')

plt.show()

# print(x1.median())
# print(x.sort_values())

# stats['data_science'].mean() 42.8129
# stats['data_scientist'].mean() 46.5744
In [449]:
from scipy import stats

plt.figure(figsize=(17,6))

plt.hist(df['data_science'], alpha=.5, bins=20, color='purple');
plt.hist(df['data_scientist'], alpha=.3, bins=20);
plt.xlabel('Interest Level')
plt.ylabel('# of Occurences')
plt.legend(['Data Science', 'Data Scientist'])
plt.axvline(df['data_science'].mean(), color='red', linestyle='-')
plt.axvline(df['data_scientist'].mean(), color='teal', linestyle='-')
plt.title('Distribution of Interest Level\nn=96\n')

plt.show()

mean1 = df['data_science'].mean()

# print(mean1)
# plt.hist(df['computer'], alpha=.5, bins=50)
# plt.hist(df['love'], alpha=.5, bins=50)
# plt.show()

# print(df.shape)

Unfortunately, Our data is very not normal...

And, if we wanna test our hypothesis, we'll need more data!

But, let's see some more charts to highlight relationships. Let's look at Averages By Month--

In [450]:
# mean_by_quarter = df.groupby(['quarter']).mean().reset_index()
# mean_by_quarter = mean_by_month.sort_values('quarter')
# mean_by_quarter['data_science'] = abs(mean_by_month['data_science'])

# f, ax = plt.subplots(figsize=(15,6))
# sns.set_style("darkgrid")
# #               {"xtick.major.size":18,
# #                "ytick.major.size":18})

# sns.barplot(x='data_science', y='quarter', data=mean_by_month)
# plt.title('\nAverage Interest Over Time, By Quarter: 2012-2019\nKeyword Term: Data Science (US)\n')
# plt.ylabel("")
# plt.xlabel("\nInterest Level (0-100)")
# plt.show()




mean_by_month = df.groupby(['month_name', 'month']).mean().reset_index()
mean_by_month = mean_by_month.sort_values('data_scientist')
mean_by_month['data_scientist'] = abs(mean_by_month['data_scientist'])

f, ax = plt.subplots(figsize=(15,6))
sns.set_style("darkgrid")
#               {"xtick.major.size":18,
#                "ytick.major.size":18})

sns.barplot(x='data_scientist', y='month_name', data=mean_by_month)
plt.title('\nAverage Interest Over Time, By Month: 2012-2019\nKeyword Term: Data Scientist (US)\n')
plt.ylabel("")
plt.xlabel("\nInterest Level (0-100)")
mean = mean_by_month['data_scientist'].mean()
median = mean_by_month['data_scientist'].median()
plt.axvline(x=mean, color='black', linestyle='--')
plt.axvline(x=median, color='blue', linestyle='--')
plt.show()







mean_by_month = df.groupby(['month_name', 'month']).mean().reset_index()
mean_by_month = mean_by_month.sort_values('data_science')
mean_by_month['data_science'] = abs(mean_by_month['data_science'])

f, ax = plt.subplots(figsize=(15,6))
sns.set_style("darkgrid")
#               {"xtick.major.size":18,
#                "ytick.major.size":18})

sns.barplot(x='data_science', y='month_name', data=mean_by_month)
plt.title('\nAverage Interest Over Time, By Month: 2012-2019\nKeyword Term: Data Science (US)\n')
plt.ylabel("")
plt.xlabel("\nInterest Level (0-100)")
mean = mean_by_month['data_science'].mean()
median = mean_by_month['data_science'].median()
plt.axvline(x=mean, color='black', linestyle='--')
plt.axvline(x=median, color='blue', linestyle='--')
plt.show()






# df['ds_all'] = (df['data_science'] + df['data_scientist']
df['ds_all'] = (df['data_science'] + df['data_scientist']) / 2


mean_by_month = df.groupby(['month_name', 'month']).mean().reset_index()
mean_by_month = mean_by_month.sort_values('ds_all')
mean_by_month['ds_all'] = abs(mean_by_month['ds_all'])

f, ax = plt.subplots(figsize=(15,6))
sns.set_style("darkgrid")
#               {"xtick.major.size":18,
#                "ytick.major.size":18})

sns.barplot(x='ds_all', y='month_name', data=mean_by_month)
plt.title('\nAverage Interest Over Time, By Month: 2012-2019\nKeyword Term: Data Scientist & Data Science (US)\n')
plt.ylabel("")
plt.xlabel("\nInterest Level (0-100)")
mean = mean_by_month['ds_all'].mean()
median = mean_by_month['ds_all'].median()
plt.axvline(x=mean, color='black', linestyle='--')
plt.axvline(x=median, color='blue', linestyle='--')

plt.show()

The bar graph shows a few things:

1. Our month with the highest average level of interest is September in both Keyword categories;

2. Our Month with the lowest average level of interest is July in both Keyword categories

3. In our control group, June was also the worst month for 'Computer'; January the worst for 'Love'

In [451]:
mean_by_month.sort_values('ds_all')
Out[451]:
month_name month data_scientist data_science computer love year weekday quarter ds_all
5 July 7 43.250 35.500 69.625 73.500 2015.5 3.000 3.0 39.3750
6 June 6 42.875 36.125 68.625 72.000 2015.5 3.625 2.0 39.5000
7 March 3 44.125 39.750 73.375 68.875 2015.5 3.500 1.0 41.9375
3 February 2 43.750 40.375 75.750 67.000 2015.5 3.250 1.0 42.0625
2 December 12 43.250 41.750 73.875 69.625 2015.5 3.750 4.0 42.5000
8 May 5 46.000 39.375 69.750 71.250 2015.5 2.375 2.0 42.6875
4 January 1 45.000 42.125 78.625 67.000 2015.5 2.875 1.0 43.5625
0 April 4 45.250 42.000 71.375 69.875 2015.5 3.000 2.0 43.6250
9 November 11 45.750 45.375 73.250 70.875 2015.5 3.500 4.0 45.5625
1 August 8 51.375 46.250 74.750 71.750 2015.5 2.500 3.0 48.8125
10 October 10 52.625 50.000 72.000 70.500 2015.5 2.250 4.0 51.3125
11 September 9 55.500 55.125 76.500 70.625 2015.5 3.750 3.0 55.3125

Here we'll see the trend increase by year, validating, visually that interest level in Data Science has increased over time...

In [454]:
mean_by_year = df.groupby(['year', 'data_scientist']).mean().reset_index()
# mean_by_year = mean_by_year.sort_values('year')
mean_by_year['data_scientist'] = abs(mean_by_year['data_scientist'])

f, ax = plt.subplots(figsize=(15,6))
sns.set_style("darkgrid")
#               {"xtick.major.size":18,
#                "ytick.major.size":18})

sns.lineplot(x='year', y='data_scientist', data=mean_by_year)
sns.set_palette("husl",3) 
plt.title('\nAverage Interest Over Time 2012-2019\nKeyword Term: Data Scientist (US)\n')
plt.ylabel("Interest Level")
plt.xlabel("\nTime in Years")
plt.show()

mean_by_year = df.groupby(['year', 'data_science']).mean().reset_index()
# mean_by_year = mean_by_year.sort_values('year')
mean_by_year['data_science'] = abs(mean_by_year['data_science'])

f, ax = plt.subplots(figsize=(15,6))
sns.set_style("darkgrid")
#               {"xtick.major.size":18,
#                "ytick.major.size":18})

sns.lineplot(x='year', y='data_science', data=mean_by_year)
sns.set_palette("husl",3) 
plt.title('\nAverage Interest Over Time 2012-2019\nKeyword Term: Data Science (US)\n')
plt.ylabel("Interest Level")
plt.xlabel("\nTime in Years")
plt.show()

mean_by_year = df.groupby(['year', 'ds_all']).mean().reset_index()
# mean_by_year = mean_by_year.sort_values('year')
mean_by_year['ds_all'] = abs(mean_by_year['ds_all'])

f, ax = plt.subplots(figsize=(15,6))
sns.set_style("darkgrid")
#               {"xtick.major.size":18,
#                "ytick.major.size":18})

sns.lineplot(x='year', y='ds_all', data=mean_by_year)
sns.set_palette("husl",3) 
plt.title('\nAverage Interest Over Time, By Year: 2012-2019\nKeyword Term: Data Scientist & Data Science (US)\n')
plt.ylabel("Interest Level")
plt.xlabel("\nTime in Years")
plt.show()
In [455]:
sns.set(style="ticks")

plt.figure(figsize=(17,10))

# rs = np.random.seed(0)
x = df['year']
y = df['ds_all']

sns.jointplot(x, y, kind="hex",
              data=df, 
              color="cyan");
plt.title('\nAverage Interest Over Time, By Year: 2012-2019\nKeyword Term: Data Scientist & Data Science (US)\n')
plt.ylabel("Interest Level")
plt.xlabel("\nTime in Years")
plt.show()
<Figure size 1224x720 with 0 Axes>

As we saw above, the charts are pretty, and we see increase over time, which answers our first question. Without normality, our analysis was more focused on the time variable. However, this did not require much science...we can't test this dataset!

Next Hypothesis!--Let's See if we can answer our second question--

Hypothesis-2.png

Data_Scientist_The_Sexiest_Job_of_the_21st_Century_-_2019-12-11_22.08.29.png

Oh no! looks like we need more years!

# Our last dataset only had 96 observations

keyword%20data%20science%20%282%29.png

Time for some science...! :)

Let's complete our analysis and experimentation in Jupyter Notebook.

Adding in data from 2004, we now have 192 observations!

For clarity, we'll stick with the Keyword "Data Scientist" to centralize our analysis within the context of career/job title vs. the subject itself.

Let's make two groups:

Group A will be before the Harvard Business Review Article (n=106)

"All interest levels, by month, for 'Data Scientist' from Jan 1. 2004 until Oct 1. 2012"

Group B will be after the Harvard Business Review Article (n=86)

"All interest levels, by month, for 'Data Scientist' from Nov 1. 2012 until Dec 1. 2019"

Can we see if there's any difference in the numbers--

Here are our stats for Group A:

count 106.000000
mean 2.792453
std 2.728045
min 0.000000
25% 1.000000
50% 2.000000
75% 3.000000
max 17.000000

Here are our stats for Group B:

count 86.000000
mean 50.441860
std 25.564039
min 10.000000
25% 27.500000
50% 47.500000
75% 75.500000
max 100.000000

Above we can see the difference in mean in the two groups. Let's see how this is visualized in a box plot:

In [460]:
plt.figure(figsize=(17,10))


plt.title ('Statistics for Interest Levels in Keyword "Data Scientist"\n2004 - 2011 vs. 2012 - 2019\n')

group_a['data scientist: (United States)'].plot.box(meanline=True);
plt.show()


plt.figure(figsize=(17,10))

group_b['data scientist: (United States)'].plot.box(meanline=True);

plt.show()

Our box plots show us two things:

1) in the first boxplot we see a lot of outliers!

  • This could be due to the fact that as our x-axis (time) increases, we see a increase in interest levels

2) In our second box plot, we see no outliers.

  • Perhaps this means there's not only an increase in interest level, but also more years where interest levels were higher than the average seen before the HBR article in 2012.

Let's test for normality across groups--

First let's plot histograms for group's a and b--

In [461]:
plt.figure(figsize=(20,10))

plt.hist(group_a['data scientist: (United States)'], bins=25, alpha= .5, color='purple')
plt.hist(group_b['data scientist: (United States)'], bins=25, alpha= .5, color='cyan')
# ax.vlines(x=group_a['data scientist: (United States)'].mean(), color='black', linestyle='--')


plt.title('Distribution of Interest Level:\n2004-2019\nGroup A & Group B\n Group A mean=2.79\n Group B mean=50.44')
plt.legend(labels=['Group A: Before HBR article','Group B: After HBR article'], 
           loc='right', 
           handlelength=5, 
           borderpad=3, labelspacing=5)

plt.show()

The distributions of both groups look not normal.

So let's confirm this by printing out some descriptive statistics.

In [462]:
data_nobs = len(group_a['data scientist: (United States)'])
data_mean = group_a['data scientist: (United States)'].mean()
data_min = group_a['data scientist: (United States)'].min()
data_max = group_a['data scientist: (United States)'].max()
data_var = group_a['data scientist: (United States)'].var()
data_skew = group_a['data scientist: (United States)'].skew()
data_kurtosis = group_a['data scientist: (United States)'].kurtosis()


print("Group A: Pre-Harvard Business Review Article - Jan 1 2004 - Oct 1 2012\n\n")
print("N (Sample Size): {}".format(round(data_nobs,2)))
print("Mean: {}".format(round(data_mean,2)))
print("Min: {}".format(round(data_min,2)))
print("Max: {}".format(round(data_max,2)))
print("Variance: {}".format(round(data_var,2)))
print("Skewness: {}".format(round(data_skew,2)))
print("Kurtosis: {}".format(round(data_kurtosis,2)))
Group A: Pre-Harvard Business Review Article - Jan 1 2004 - Oct 1 2012


N (Sample Size): 106
Mean: 2.79
Min: 0
Max: 17
Variance: 7.44
Skewness: 2.32
Kurtosis: 7.67

Again, these results are not in line with normality-- perhaps we can use further methods to test for normality--

Given that our test confirms an extremely low level of significance (p=6.21e-16), we're rejecting the null hypothesis.

we should reconsider answering our original question...

It's very likely that the observations seen in the experiment are not reflective of the population.

But to be sure, let's perform on last test to confirm our results further

We can further test this result by using the Shapiro-Wilk Test for Normality:

Results

A Shapiro-Wilk test confirms that unfortunately, if we were to use the raw Google Trends data to answer our original questions, the likelihood of our observations reflecting in truth in the general population, that our data is showing a statistically significant increase in trend is <1%.

p-value = 2.212...e-11

Conclusion/Discussion:

- Based on this research, it's more clear that relying on raw data from Google Trends is not a scientifically-backed decision-maker

  • Motivations
    • validation
    • Socio-economic
    • Bio-cognitive1 (demonstrates how thoughts and their biological expression coemerge within a cultural history)
      • Imposter Syndrome
    • Philosophical - Metaphysics
      • "What's real and what's not?"
      • How differentiated is the human from it's machine/the internet?
    • Scientific/Technological
      • Quantization2
      • Geographic Information intelligence
      • Stationarity of Time-Series Data

Conclusion/Discussion Cont'd:

  • Opportunities for further research

    • Would like to add a map of the country
    • more "time buckets"
  • Biases

    • Search Engine Market Share
    • Radical Anonymity
  • Shifts in Perspectives

    • As communication and spread of information has increased... ...our reliance on the internet to debunk life's mysteries should be examined on a interpersonal level.

Sources:

Raw Data - Interest Level over Time:

"Google Trends Data - Search Term: Data Scientist, January 1, 2004 - December 1, 2019; Interest over Time (https://trends.google.com/trends/explore?date=2004-01-01%202019-12-01&geo=US&q=data%20scientist)

"Google Trends Data - Search Term: Data Science, January 1, 2004 - December 1, 2019; Interest over Time (https://trends.google.com/trends/explore?date=2004-01-01%202019-12-01&geo=US&q=data%20science)

"Google Trends Data - Search Term: Data Science vs. Data Scientist, January 1, 2004 - December 1, 2019; Interest over Time" (https://trends.google.com/trends/explore?date=2004-01-01%202019-12-01&geo=US&q=data%20science,data%20scientist)

"1Biocognitive https://www.biocognitive.com/index.php/philosophy/page.html)

"2Quantization" https://www.sciencedirect.com/topics/engineering/quantisation

GIS data:

"Google Trends Data - Search Term: Data Scientist, January 1, 2004 - December 1, 2019; Interest by SubRegion https://trends.google.com/trends/explore?date=2004-01-01%202019-12-01&geo=US&q=data%20scientist

"Google Trends Data - Search Term: Data Science, January 1, 2004 - December 1, 2019; Interest by SubRegion https://trends.google.com/trends/explore?date=2004-01-01%202019-12-01&geo=US&q=data%20science

In [ ]: